Feature #3: Find Dictionary

Description#

We use a network protocol that encrypts all application messages using a proprietary scheme. The encryption scheme has a unique property that the sequence of encrypted messages in a session appears to be in sorted order according to a secret dictionary. However, the dictionary is not transmitted for security purposes.

Before the sender starts transmitting actual messages, it sends several encrypted training messages to the receiver. The sender guarantees that the training messages will follow the lexicographic order according to the unknown dictionary.

The receiver must reverse engineer the training messages and generate the dictionary for future communication with the sender. If the order of the messages is invalid, the receiver generates an empty dictionary and asks the sender to retransmit the training messages.

For simplicity’s sake, we can assume that the encrypted contents of the messages only consist of English lowercase letters.

Let’s review a few examples below:

Created with Fabric.js 3.6.6
A valid sequence of encrypted messages

1 of 2

Created with Fabric.js 3.6.6 The ordering of the message is invalid, as "w" can be eithergreater than "h" or less than "h", but not both at the same time.
A potentially tampered sequence of encrypted messages

2 of 2

Solution#

We can essentially map this problem to a graph problem but, before exploring the exact details of the solution, there are a few things that we need to keep in mind:

  • The letters within a message don’t tell us anything about the relative order. For example, the message educative in the list does not tell us that the letter e is before the letter d.

  • The input can contain messages followed by their prefix, for example, educated and then educate. These cases will never result in a valid alphabet (because in a valid alphabet, prefixes are always first). We’ll need to make sure our solution detects these cases correctly.

  • There can be more than one valid alphabet ordering. It is fine for our algorithm to return any one of them.

  • The output dictionary must contain all unique letters within the messages’ list, including those that could be in any position within the ordering. It should not contain any additional letters that were not in the input.

Now back to the graph problem part, we can break this particular problem into three parts:

  1. Extract the necessary information to identify the dependency rules from the messages. For example, in the messages provided in the slides above, ["decode", "interview"], the letter d comes before i.

  2. With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and the dependencies (order) as the edges.

  3. Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary).

Let’s look at each part in more depth.

Part-1: Identifying the dependencies#

Let’s start with an example of encrypted training messages and observe the initial ordering through simple reasoning:

Encrypted Training Messages

As in the English language dictionary, where all the words starting with a come at the start followed by the words starting with b, c, d, and so on, we can expect the first letters of each message to be in alphabetical order as well.

First Letters

Removing the duplicates, we get:

First Letters without Duplicates

The following illustration might clarify this step:

mzosr
mzosr
mqov
mqov
xxsvq
xxsvq
xazv
xazv
xazau
xazau
xaqu
xaqu
suvzu
suvzu
suvxq
suvxq
suam
suam
suax
suax
rom
rom
rwx
rwx
rwv
rwv
m
m
m
m
x
x
x
x
x
x
x
x
s
s
s
s
s
s
s
s
r
r
r
r
r
r
m
m
x
x
s
s
r
r
Following the intuition explained above, we can assume 
that the first letters in the messages are in alphabetical order.
Following the intuition explained above, we can assume...
Removing the duplicates
Removing the duplicates
Viewer does not support full SVG 1.1
First letter order

Looking at the letters above, we know the relative order of these letters but do not know how these letters fit in with the rest of the letters. To get more information, we’ll need to look further into our English dictionary analogy. The word dirt comes before dorm. This is because we look at the second letter when the first letter is the same. In this case, i comes before o in the alphabet.

We can apply the same logic to our encrypted messages and look at the first two messages, mzsor and mqov. As the first letter is the same in both the messages, we look at the second letter. The first message has z, and the other second one has q. Therefore, we can safely say that z comes before q. We now have two fragments of the letter-order:

Second letter order

We don’t know yet how these two fragments could fit together into a single ordering. For example, we don’t know whether m is before q, or q is before m, or even whether or not there’s enough information available in the input for us to know.

Anyway, we’ve now gotten all the information we can out of the first two words. All letters after z in mzosr, and after q in mqov, can be ignored because they do not impact the relative ordering of the two words. To better understand this, we can think back to dirt and dorm. Because i > o, the rt and rm parts are unimportant for determining alphabetical ordering.

Hopefully, we can see a pattern here. When two messages are adjacent, we need to look for the first difference between them. That difference tells us the relative order between two letters. Let’s have a look at all the relations we can extract by comparing adjacent messages:

mzosr
mzosr
mqov
mqov
xxsvq
xxsvq
xazv
xazv
xazau
xazau
xaqu
xaqu
suvzu
suvzu
suvxq
suvxq
suam
suam
suax
suax
rom
rom
rwx
rwx
mqov
mqov
xxsvq
xxsvq
xazv
xazv
xazau
xazau
xaqu
xaqu
suvzu
suvzu
suvxq
suvxq
suam
suam
suax
suax
rom
rom
rwx
rwx
rwv
rwv
z -> q
z -> q
m -> x
m -> x
x -> a
x -> a
v -> a
v -> a
z -> q
z -> q
x -> s
x -> s
z -> x
z -> x
v -> a
v -> a
m -> x
m -> x
s -> r
s -> r
o -> w
o -> w
x -> v
x -> v
messages[i]
messages[i]
messages[i + 1]
messages[i + 1]
inferred rule
inferred rule
Viewer does not support full SVG 1.1
Dependencies

Notice that we did not mention rules such as m -> a. This is fine because we can derive this relation from m -> x, x -> a.

This is it for the first part. Let’s put the pieces that we have in place.

Part-2: Representing the dependencies#

We now have a set of relations mentioning the relative order of the pairs of letters:

Dependency List

Now the question arises, how can we put these relations together? One might be tempted to start chaining all these together. Let’s look at a few possible chains:

m - > x
m - > x
x -> s
x -> s
s -> r
s -> r
m -> x -> s -> r
m -> x -> s -> r
m -> x
m -> x
x -> a
x -> a
m -> x -> a
m -> x -> a
z -> x
z -> x
x -> s
x -> s
s -> r
s -> r
z -> x -> s -> r
z -> x -> s -> r
o ->  w
o ->  w
o -> w
o -> w
Viewer does not support full SVG 1.1
Possible chains

As we can observe from our chains above that some letters might appear in more than one chain and putting the chains into the output list one after the other won’t work. Some of the letters might be duplicated and would result in an invalid ordering.

Let’s try to visualize the relations better with the help of a graph. The nodes are the letters, and an edge between two letters, x and y represents that x is before y in the encrypted messages.

g x x v v x->v a a x->a s s x->s u u q q z z z->x z->q m m m->x o o w w o->w v->a r r s->r
Graph representation

Part-3: Generating the dictionary#

As we can see from the graph, four of the letters have no incoming arrows. What this means is that there are no letters that have to come before any of these four.

Remember that there could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them.

Therefore, a valid start to the ordering we return would be:

["o", "m", "u", "z"]

We can now remove these letters and edges from the graph because any other letters that required them first will now have this requirement satisfied.

g x x v v x->v a a x->a s s x->s q q w w v->a r r s->r
Remove the first order letters

On this new graph, there are now three new letters that have no in-arrows. We can add these to our output list.

["o", "m", "u", "z", "x", "q", "w"]

Again, we can remove these from the graph.

g v v a a v->a s s r r s->r
Remove the second order letters

Then add the two new letters with no in-arrows.

["o", "m", "u", "z", "x", "q", "w", "v", "s"]

This leaves the following graph:

g a a r r
Remove the remaining letters

We can place the final two letters in our output list and return the ordering:

["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]

Let’s now review how we can implement this approach next.

Algorithm#

Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the relations and insert them into an adjacency list:

z
z
q
q
u
u
x
x
m
m
s
s
r
r
a
a
v
v
w
w
o
o
[q, x]
[q, x]
[ ]
[ ]
[ ]
[ ]
[s, a, v]
[s, a, v]
[x]
[x]
[r]
[r]
[ ]
[ ]
[ ]
[ ]
[a]
[a]
[ ]
[ ]
[w]
[w]
Viewer does not support full SVG 1.1
Adjacency list

Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a little complicated. A naive approach would be to repeatedly iterate over the adjacency lists of all the other nodes and check whether or not they contain a link to that particular node.

This naive method would be fine for our case, but perhaps we can do it more optimally.

An alternative is to keep two adjacency lists:

  • One with the same contents as the one above
  • One reversed that shows the incoming links

This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list:

z
z
q
q
u
u
x
x
m
m
s
s
r
r
a
a
v
v
w
w
o
o
[ ]
[ ]
[z]
[z]
[ ]
[ ]
[m, z]
[m, z]
[ ]
[ ]
[x]
[x]
[s]
[s]
[x, v]
[x, v]
[x]
[x]
[o]
[o]
[ ]
[ ]
Viewer does not support full SVG 1.1
Reverse adjacency list

What if we told you that we could do better than this? Instead of tracking the incoming links for all the letters from a particular letter, we can track the count of how many incoming edges there are. We can keep the indegree count of all the letters along with the forward adjacency list.

Indegree corresponds to the number of incoming edges of a node.

Let’s see how that might look like below:

z
z
q
q
u
u
x
x
m
m
s
s
r
r
a
a
v
v
w
w
o
o
0
0
1
1
0
0
2
2
0
0
1
1
1
1
2
2
1
1
1
1
0
0
Viewer does not support full SVG 1.1
Indegree counts

Now, we can decrement the indegree count of a node instead of removing it from the reverse adjacency list. When the indegree of the node reaches 0, this represents that this particular node has no incoming links left.

We perform BFS on all the letters that are reachable, i.e., the indegree count of the letters is zero. A letter is only reachable once the letters that need to be before it have been added to the output, result.

We use a queue to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero indegree count. We keep adding the letters to the queue as their indegree counts become zero.

We continue this until the queue is empty. Next, we check whether all the letters in the messages have been added to the output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. In this case, we return "".

Remember that there can be letters that do not have any incoming edges. This can result in different orderings for the same set of messages, and that’s alright.

Let’s try to visualize the algorithm with the help of a set of slides below:

Created with Fabric.js 3.6.6 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] Adjacency list Indegree count z 0 q 1 u 0 x 2 m 0 s 1 r 1 a 2 v 1 w 1 o 0 Output Queue z u m o
Identify the relations and initialize the data structures

1 of 19

Created with Fabric.js 3.6.6 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] Adjacency list Indegree count Output Queue z 0 q 1 u 0 x 2 m 0 s 1 r 1 a 2 v 1 w 1 o 0 z u m o z
Traverse the graph using BFS

2 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z u m o q z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z 0 q 0 u 0 x 1 m 0 s 1 r 1 a 2 v 1 w 1 o 0
Traverse the graph using BFS

3 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 1 m 0 s 1 r 1 a 2 v 1 w 1 o 0 u m o q z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z u
Traverse the graph using BFS

4 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 1 m 0 s 1 r 1 a 2 v 1 w 1 o 0 m o q z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z u m
Traverse the graph using BFS

5 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue o q x z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z 0 q 0 u 0 x 0 m 0 s 1 r 1 a 2 v 1 w 1 o 0 z u m
Traverse the graph using BFS

6 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue o q x z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z 0 q 0 u 0 x 0 m 0 s 1 r 1 a 2 v 1 w 1 o 0 z u m o
Traverse the graph using BFS

7 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue q x w z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z 0 q 0 u 0 x 0 m 0 s 1 r 1 a 2 v 1 w 0 o 0 z u m o
Traverse the graph using BFS

8 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z u m o q z 0 q 0 u 0 x 0 m 0 s 1 r 1 a 2 v 1 w 0 o 0 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] q x w
Traverse the graph using BFS

9 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 0 m 0 s 1 r 1 a 2 v 1 w 0 o 0 x w z u m o q x z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w]
Traverse the graph using BFS

10 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z u m o q x z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] w s v z 0 q 0 u 0 x 0 m 0 s 0 r 1 a 1 v 0 w 0 o 0
Traverse the graph using BFS

11 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z 0 q 0 u 0 x 0 m 0 s 0 r 1 a 1 v 0 w 0 o 0 w s v z u m o q x w
Traverse the graph using BFS

12 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 0 m 0 s 0 r 1 a 1 v 0 w 0 o 0 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] s v z u m o q x w s
Traverse the graph using BFS

13 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z u m o q x w s v r z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 1 v 0 w 0 o 0 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w]
Traverse the graph using BFS

14 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] v r z u m o q x w s v z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 1 v 0 w 0 o 0
Traverse the graph using BFS

15 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue r a z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 0 v 0 w 0 o 0 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z u m o q x w s v
Traverse the graph using BFS

16 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue r a z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 0 v 0 w 0 o 0 z u m o q x w s v r z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w]
Traverse the graph using BFS

17 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 0 v 0 w 0 o 0 a z u m o q x w s v r a z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w]
Traverse the graph using BFS

18 of 19

Created with Fabric.js 3.6.6 Adjacency list Indegree count Output Queue z 0 q 0 u 0 x 0 m 0 s 0 r 0 a 0 v 0 w 0 o 0 z [q, x] q [ ] u [ ] x [s, a, v] m [x] s [r] r [ ] a [ ] v [a] w [ ] o [w] z u m o q x w s v r a
The final dictionary

19 of 19

Let’s review the actual implementation code below:

Find dictionary

Complexity measures#

Time Complexity Space Complexity
O(c)O(c) O(1)O(1)

Let nn be the total number of messages in the input list.

Let cc be the total length of all the messages in the input list, added together.

Let uu be the total number of unique letters in the messages. While this is limited to 2626 in our case, we’ll still look at how it would impact the complexity if this was not the case.

Time complexity#

There are three parts to the algorithm:

  • identifying all the relations
  • putting them into an adjacency list
  • converting it into a valid alphabet ordering

In the worst case, the identification and initialization parts require checking every letter of every word, which is O(c)O(c).

For the generation part, we can recall that a breadth-first search has a cost of O(v+e)O(v + e), where vv is the number of vertices, and ee is the number of edges. Our algorithm has the same cost as BFS, as it too is visiting each edge and node once.

A node is visited once all of its edges are visited, unlike the traditional BFS where it is visited once any edge is visited.

Therefore, determining the cost of our algorithm requires determining how many nodes and edges there are in the graph.

Nodes: We know that there is one vertex for each unique letter, i.e., O(u)O(u) vertices.

Edges: We generate each edge in the graph by comparing two adjacent messages in the input list. There are n1n - 1 pairs of adjacent words, and only one edge can be generated from each pair. We can again look back at the English dictionary analogy to make sense of this:

"dirt"
"dorm"

The only conclusion we can draw is that i is before o. This is the reason dirt appears before dorm in an English dictionary. As explained in the solution, the remaining letters rt and rm are irrelevant for determining the alphabetical ordering.

Remember that we only generate rules for adjacent messages and do not add the “implied” rules to the adjacency list.

So with this, we know that there are at most n1n - 1 edges.

We can place one additional upper limit on the number of edges as it is impossible to have more than one edge between each pair of nodes. With uu nodes, this means there can’t be more than u2u^2 edges.

As the number of edges has to be lower than both n1n - 1 and u2u^2, we know it is at most the smallest of these two values: min(u2,n1)min(u^2, n - 1).

We can now substitute the number of nodes and the number of edges in our breadth-first search cost:

  • v=uv = u
  • e=min(u2,n1)e = min(u^2, n - 1)

This gives us:

O(v+e)=O(u+min(u2,n1))=O(u+min(u2,n))O(v + e) = O(u + min(u^2, n - 1)) = O(u + min(u^2, n))

Finally, we combine the two parts: O(c)O(c) for the first two parts and O(u+min(u2,n))O(u + min(u^2, n)) for the third part. As we have two independent parts, we can add them and look at the final formula to see whether or not we can identify any relation between them. Combining them, we get:

O(c)+O(u+min(u2,n))=O(c+u+min(u2,n))O(c) + O(u + min(u^2, n)) = O(c + u + \min(u^2, n))

So, what do we know about the relative values of nn, cc, and uu? We can deduce that both nn, the total number of messages, and uu, the total number of unique letters are smaller than the total number of letters, cc, as each message contains at least one character and there can’t be more unique characters than there are characters.

Now, we know that cc is the biggest of the three, but we do not know the relation between nn and uu.

Let’s simplify our formulation a little as we know that the uu bit is insignificant compared to the cc:

O(c+u+min(u2,n))>O(c+min(u2,n))O(c + u + min(u^2, n)) -> O(c + min(u^2, n))

Let’s now consider two cases to simplify it a little further:

  • If u2u^2 is smaller than nn, then min(u2,n)=u2min(u^2, n) = u^2. We have already established that u2u^2 is smaller than nn, which is, in turn, smaller than cc, and so u2u^2 is definitely less than cc. This leaves us with O(c)O(c).

  • If u2u^2 is larger than nn, then min(u2,n)=nmin(u^2, n) = n. Because c>nc > n, we’re left with O(c)O(c).

So in all cases, we know that c>min(u2,n)c > min(u^2, n). This gives us a final time complexity of O(c)O(c).

Space complexity#

The space complexity is O(1)O(1) or O(u+min(u2,n))O(u + min(u^2, n)).

The adjacency list uses O(v+e)O(v + e) memory, which in the worst case is min(u2,n)min(u^2, n), as explained in the time complexity analysis.

So in total, the adjacency list takes O(u+min(u2,n))O(u + min(u^2, n)) space. Hence, the space complexity for a large number of letters is O(u+min(u2,n))O(u + min(u^2, n)).

But, for our use case, where uu is fixed at a maximum of 2626, the space complexity is O(1)O(1). This is because uu is fixed at 2626, and the number of relations is fixed at 26226^2, so O(min(262,n))=O(262)=O(1)O(min(26^2, n)) = O(26^2) = O(1).

Feature #2: Verify Message Integrity

Feature #4: Ways to Decode Message